There are a number of optical illusions that are frequently cited as demonstrating that our senses do not always accurately report the reality of the world around us. For instance, Fig. 6(a) shows a diagonal line that is interrupted by two parallel vertical lines. It appears that the two halves of the diagonal are misaligned. Part (b) shows the Necker cube (familiar to quilters as "tumbling blocks") which can be seen in two ways: either as jutting up and to right or as receding back and to the left. Interestingly, it is usually possible to perceive this shape either way, but not both simultaneously. The figure in (c) appears to be a forked object. At the left it appears to have two prongs, while at the right it has three prongs. Most people "see" a triangular shape in part (d), even though the reality is that there are only three pac-man-like partial circles. Part (e) shows Penrose's "impossible tribar" which appears to be a triangular solid built from three 90 degree right angles. Even though geometry proves that the sum of all angles in a triangle must total 180 degrees, it still looks like the tribar could exist in three dimensional space. The ever-ascending stairway (f) from was made famous by the graphic artist M. C. Escher in his prints "Ascending and Descending" and in the enigmatic "Waterfall."
Less well known, but just as fundamental as such visual tricks, are everyday aspects of audio perception that are "illusions" in the sense that what we perceive is very different from the reality in the physical world surrounding us. For example, if I were to say "think of a steady, continuous, unchanging sound like the tone of an organ," you could most likely do so without difficulty. In reality, however, there is nothing "steady" or "continuous" or "unchanging" about an organ sound. The physical reality is that an organ sound (like any sound) is an undulating wave with alternating regions of high and low pressure, air molecules must be constantly wiggling back and forth. Were the pressure changes and the motions to cease, then so would the sound.
Because the undulations occur at a rate too fast to perceive separately, the auditory system "blurs" them together and they achieve the illusion of steadiness. Thus sounds which occur closer together in time than the threshold of simultaneity (Fig. 5 places this at about 1 ms) are merged into a "single sound." This is parallel to the everyday illusion that television (and movies) show continuous action; in reality movies consist of a sequence of still photos which are shown at a rate faster than the threshold of simultaneity for vision, which is about 20 Hz. Closely related to the illusion of continuity is the illusion of simultaneity: sounds appear to occur at the same time instant even though in reality they do not. The ear tends to bind events together when they occur close to each other in time and to clearly separate others that may be only slightly further apart.
Another common auditory "illusion" is based on the ear's propensity to categorize certain kinds of sounds. For example, it is easy to say the vowel "a" and to slowly (and "continuously") change it into the vowel "e." Thus there is a continuum of possible vowel sounds: "a" at one end, two thirds "a" and one third "e," one half "a" and one half "e," one third "a" and two thirds "e," continuing to a pure "e" at the other end. Yet no matter who is speaking, one never perceives the intermediate vowel sounds. The sound is automatically categorized by the auditory system into either "a" or "e," never into something in between. This is called categorical perception, and it has obvious importance in the comprehension of language.
While there are many similarities between visual and auditory perceptions, there are also significant differences. For example, in 1886 Mach demonstrated that spatial symmetry is directly perceptible to the eye whereas temporal symmetry is not directly perceptible to the ear. Unlike vision, the human ability to parse musical rhythms inherently involves the measurement of time intervals.
Loosely speaking, pitch is the perceptual analog of frequency. Acousticians define pitch formally as "that attribute of auditory sensation in terms of which sounds may be ordered on a scale extending from low to high." Sine waves, for which the frequency is clearly defined, have unambiguous pitches because everyone orders them the same way from low to high. For non-sine wave sounds, such an ordering can be accomplished by comparing the sound with unknown pitch to sine waves of various frequencies. The pitch of the sinusoid that most closely matches the unknown sound is then said to be the pitch of that sound.
Pitch determinations are straightforward when working with musical instruments that have a clear fundamental frequency and harmonic overtones. When there is no discernible fundamental, however, the ear will often create one. Such virtual pitch occurs when the pitch of the sound is not the same as the pitch of any of its overtones. This is shown on the Auditory Demonstrations CD, where the "Westminster Chimes" song is played using only upper harmonics. In one demonstration, the sounds have spectra like that shown in Fig. 7. This particular note has partials at 780, 1040, and 1300 Hz, which is clearly not a harmonic series. These partials are, however, closely related to a harmonic series with fundamental at 260 Hz, because the lowest partial is 260 times 3, the middle partial is 260 times 4, and the highest partial is 260 times 5. The ear recreates the missing fundamental, and this perception is strong enough to support the playing of melodies, even when the particular harmonics used to generate the sound change from note to note. Thus the ear can create pitches even when there is no stimulus at the frequency of the corresponding sinusoid. This is somewhat analogous to the "triangle" that is visible in Fig. 6(d).
Perhaps the most striking demonstration that pitch is a product of the mind and not of the physical world is Shepard's ever rising tone, which is an auditory analog of the ever-ascending staircase in Fig. 6(f). Sound example EverRise presents an organ-like timbre that is constructed as diagrammed in Fig. 8. The sound ascends chromatically up the scale: after ascending one full octave, it has returned to its starting point and ascends again. The perception is that the tone rises forever (this version is about 5 minutes long) even though it never actually leaves a single octave!
The ear's job (to engage in some introductory chapter anthropomorphization) is to make sense of the auditory world surrounding it. This is not an easy job, because sound consists of nothing more than ephemeral pressure waves embedded in a complex three dimensional world. The sound wave that arrives at the ear is a conglomeration of sound from all surrounding events merged together. The ear must unmerge, untangle, and interpret these events. The ear must capture the essence of what is happening in the world at large, simultaneously emphasizing the significant features and removing the trivial. Does that sound represent the distant rustling of a lion or the nearby bustling of a deer? The distinction may be of some importance.
Given this formidable task, the ear has developed a sophisticated multi-level strategy. In the first stage, collections of similar sense impressions are clustered into objects of perception called auditory events. This occurs on a very short time scale. At the second stage, auditory events that are similar in some way are themselves grouped together into larger chunks, to form patterns and categories that are most likely the result of learning and experience.
For example, the low level processing might decode a complex wave into an auditory event described by the phoneme "a." This represents a huge simplification because while there are effectively an infinite variety of possible waveshapes, there are only about 45 distinct phonemes (in English). At the next stage, successive phonemes are scanned and properly chunked together into words, which can then invoke various kinds of long term memory where something corresponding to meaning might be stored.
As another example, the low level processing might decode a complex waveform into an auditory event such as the performance of a musical "note" on a familiar instrument; the C of a flute. Again, this represents a simplification because there are only a few kinds of instruments while there are an infinite variety of waveforms. At the next stage, several such notes may be clustered to form a melodic or rhythmic pattern, again, condensing the information into simple and coherent clusters that can then be presented to long term memory and parsed for meaning.
Thus the ear's strategy involves simplification and categorization. A large amount of continuously variable data arrives; a (relatively) small amount of well categorized data leaves, to be forwarded to the higher processing centers. In the normal course of events, this strategy works extremely well. If, for instance, two sounds are similar (by beginning at the same time, by having a common envelope, by being modulated in a common way, by having a common period, by arriving from the same direction, etc.) then they are likely to be clustered into a single event. This makes sense because in the real world, having such similarities implies that they are likely to have arisen from the same source. This is the ear doing its job.
If we, as devious scientists, happen to separate out the cues associated with legitimate clustering and to manipulate them independently, then it should come as no surprise that we can "fool" the ear into perceiving "illusions." The pitch illusions are of exactly this kind. It would be a rare sound in the real world that would have multiple harmonically related partials yet have no energy at the frequency corresponding to their common period (such as occurs in Fig. 7). It would be an even rarer sound that spanned the complete audio range in such a way that the highest partials faded out of awareness exactly as the lowest partials entered.
Illusions show the limitations of our perceptual apparatus. Somewhat paradoxically, they are also helpful in distinguishing what is "really" in the world from what is "really" in our minds.
Consider two friends talking. It might appear that a tape recording of their conversation would contain all the information needed to understand the conversation. Indeed, you or I could listen to the recording, and, providing it was in a language we understood, reconstruct much of the meaning. But there is currently no computer that can do the same. Why? The answer is, at least in part, because the recording does not contain anywhere near "all" the information. There are two different levels at which it fails. First, the computer does not know English and lacks the cultural, social, and personal background that the two friends share. Second, it lacks the ability to parse and decode the audio signal into phonemes and then into words. Thus the computer fails at both the cognitive and the perceptual levels.
The same issues arise when attempting to automate the interpretation of a musical passage. What part of the music is in the signal, what part is in the perceptual apparatus of the listener, and what part is in the cognitive and/or cultural framework in which the music exists? Features of the music that operate at the cognitive level are unlikely to yield to automation because the required information is vast. Features that fundamentally involve perceptual processing may yield to computer analysis if an appropriate way to pre-process the signal in an analogous fashion can be found. Only features that are primarily "in the signal" are easy. Illusions can help distinguish which parts of our sense impressions correspond directly to features of the world, and which do not. As will become clear, the things we call "notes," "beats," "melodies," "rhythms," and "meter" are objects of cognition or perception and not primary sense impressions; they are "illusions" in the mind of the listener and not intrinsic properties of the musical signal.